Is the Bellman residual a bad proxy?
نویسندگان
چکیده
This paper aims at theoretically and empirically comparing two standard optimization criteria for Reinforcement Learning: i) maximization of the mean value and ii) minimization of the Bellman residual. For that purpose, we place ourselves in the framework of policy search algorithms, that are usually designed to maximize the mean value, and derive a method that minimizes the residual ‖T∗vπ − vπ‖1,ν over policies. A theoretical analysis shows how good this proxy is to policy optimization, and notably that it is better than its value-based counterpart. We also propose experiments on randomly generated generic Markov decision processes, specifically designed for studying the influence of the involved concentrability coefficient. They show that the Bellman residual is generally a bad proxy to policy optimization and that directly maximizing the mean value is much better, despite the current lack of deep theoretical analysis. This might seem obvious, as directly addressing the problem of interest is usually better, but given the prevalence of (projected) Bellman residual minimization in value-based reinforcement learning, we believe that this question is worth to be considered.
منابع مشابه
Should one minimize the Bellman residual or maximize the mean value?
This paper aims at theoretically and empirically comparing two standard optimization criterion for Reinforcement Learning: i) maximization of the mean value (predominant approach in policy search algorithms) and ii) minimization of the Bellman residual (mainly used in approximate dynamic programming). For doing so, we introduce a new policy search algorithm based on the minimization of the resi...
متن کاملShould one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view
We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the one-step Temporal Difference fix-point computation (TD(0)) and the Bellman Residual (BR) minimization. We describe examples, where each method out-performs the other. We highlight a simple relation between the ob...
متن کاملFinite-sample Analysis of Bellman Residual Minimization
We consider the Bellman residual minimization approach for solving discounted Markov decision problems, where we assume that a generative model of the dynamics and rewards is available. At each policy iteration step, an approximation of the value function for the current policy is obtained by minimizing an empirical Bellman residual defined on a set of n states drawn i.i.d. from a distribution ...
متن کاملAPPLICATION OF THE BELLMAN AND ZADEH'S PRINCIPLE FOR IDENTIFYING THE FUZZY DECISION IN A NETWORK WITH INTERMEDIATE STORAGE
In most of the real-life applications we deal with the problem of transporting some special fruits, as banana, which has particular production and distribution processes. In this paper we restrict our attention to formulating and solving a new bi-criterion problem on a network in which in addition to minimizing the traversing costs, admissibility of the quality level of fruits is a main objecti...
متن کاملRobust Value Function Approximation Using Bilinear Programming
Existing value function approximation methods have been successfully used in many applications, but they often lack useful a priori error bounds. We propose approximate bilinear programming, a new formulation of value function approximation that provides strong a priori guarantees. In particular, this approach provably finds an approximate value function that minimizes the Bellman residual. Sol...
متن کامل